The goal of our project was to establish a link between : the hygien control and the rating of the different restaurant. The idea was simple we found a database of the control hygien for different restaurant in manhattan and we decided to scrape the review from trip advisor to see if we could do some links between them.
Our assumption were that certain type of restaurant would be impacted differently in the rating when they had a bad hygien control. We thought that it would depend mainly on the location, the price and the type of cuisine. Bassically a cheap restaurant should be less impacted than a more costly one since logically people wouldn’t not care a lot because they are not paying that much. On the contrary when people pay a lot of money to eat somewhere it’s seems logical that they would expect some high quality and irreproachable hygien. Also the cuisine type and location were believed to play kind of the same role since at least in the US some neighbourhood are poorer and some type of cuisine are seen as more cheap or more fancy.
We are going to present the process that enable us to come to our finding. Firstly we’ll talk about the different method we used and then present them along with our R code. Finnally we’ll present our findings and conclude on them.
In our project we used quite a lot of concept/method that we didn’t know before, so it seems interesting to have a methodology section and to try to present them a bit.
The first thing we needed to do is scraping the data from TripAdvisor. To do so we needed to understand a litlle bit how html and css code worked, but thanks to our webbrowser and its development tools we can just highlight the object we want for instance the restaurant name and the dev tools shows us the relevant div. Than we needed R to get the information we wanted oon the pages we wanted. To do that we used RSelenium and Docker, what it does is emulated a webbrowser from the terminal and than R can access it with selenium commands and thus get the html and the information we wanted. Note that each time we scrape we needed to run Docker server from the terminal, we get to the problem and technical details later. We did use this process because TripAdvisor api did not allowed us to register since we are not a business.
After that we find ourselve with to database. The one we “created” was really messy so we needed to clean it a lot. After some basics transformation on the data as such as putting the date in date format and calculating an average column for the rating, the most interesting things we used in this parts was probably the google api and the nest function of tidyverse. The nest function basically allow you to put dataframe inside dataframe, the structure is more of a list inside a dataframe but the idea is there. This was really helpfull to us since we had one line per restaurant but hundreds of review for each restaurant. So we nested the two dataframe and join them together which was no easy task because of the diffenrences in the names. At this point we wanted to plot the restaurant we scraped on a map. We had the addresses but to plot them on a map we needed the geolocalisation. The geocode function allow us to get this information from google but when we collected the data some where randomly missing. To solve that problem we used a google api key to register ourselves which allow us to access a certain amount of request on the google api, it worked like a charm. To do so we also needed to use the devtool package so we could install a custom version of ggplot where register_google function allow us to register our api key. After all that we finnaly had our data.
Now it was time to explore our data and plot some graph and maps. First we plot some basic graph, nothing fancy but it allow us to see clearly what we had. Then we us the leaflet package to plot the restaurant on an interactive map where we could display them by categorie or hide some of them. The plotly allowed us to do the same for the plot of the time series (average rating vs date). For the time series we also needed to pad the series since obviously we don’t had review on a regular basis.
At the end we decided to present our project in the form of a website. We used github page to host our site. We thus had to learn once again a bit of html and css and to use Jekyll. Jekyll allowed us to host our website locally which was more convenient to create it. Jekyll is also installed and used with terminal.
we also wanted to mentioned that due to the scraping part, when we cleaned the data we were forced to eliminate some of the restaurant since some data were lacking. In the end we were capable of having complete data for a bit more than 300 restaurant.
We used quite a lot of function, packages and concept, it could seems a little bit messy but to go more into details our code is following this section with some command on what command does what.
This is all the packages we used :
library(rvest)
library(RSelenium)
library(wdman)
library(XML)
library(plyr)
library(tidyverse)
library(ggmap)
library(lubridate)
library(leaflet)
library(devtools)## to install new ggmap package and use google api
library(knitr)
library(padr)
library(plotly)
library(broom)
library(kableExtra)
To execute this code you first need to install docker on you cumputer and to run it on the terminal with the line of code docker run. After that you can access your emulated webbrowser with the following comand :
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("http://www.google.com") # navigate to the Google web page
remDr$getTitle() # If it returns "Google", the server is operational
The first thing we did was srape the name, the type of cuisine and price, and the page URL of each restaurant we could. The problem was that at some point we hit the query limit so we were not able to take them all and some data were wrong in parts because RSelenium doesn’t have adblock so we get some data about the ad. Also since TripAdvisor restaurant display is conceive around an interactive box, we need to click to next page every 30 restaurant. But with the following code we were able to scrape around 1800 restaurant name and webpage.
# Creation of the dataframe which will stock the data scraped
df <- data.frame(name = character(), cuisine = character(), page_url = character(), stringsAsFactors = FALSE)
#scrape
pb = txtProgressBar(min = 0, max = 1, style = 3) #Creation of a progress bar to visualize the progression of the for loop
for(i in 1 : 100 ){
overview <- remDr$findElement(using="css", value="[class='deckA eatery_overview']") #
#read the name
name.elems <- overview$findChildElements(using="css", value="[class='property_title']")
for(j in 1:length(name.elems)){
name.elem <- name.elems [[j]]
df[j+i*length(name.elems)-length(name.elems),1] <- name.elem$getElementText()
}
## read the cuisines attribute
cuisine.elems <- overview$findChildElements(using="css", value="[class='cuisines']")
for(k in 1:length(cuisine.elems)){
cuisine.elem <- cuisine.elems [[k]]
df[k+i*length(cuisine.elems)-length(cuisine.elems),2] <- cuisine.elem$getElementText()
}
## read the page url of the restaurant
url.elems <- overview$findChildElements(using="css", value="[class='property_title']")
for(o in 1:length(url.elems)){
url.elem <- url.elems [[o]]
rhs <- sub(".*' *(.*?) *'.*", "\\1", url.elem$getElementAttribute("onclick"))
df[o+i*length(url.elems)-length(url.elems),3] <- str_c("https://www.tripadvisor.com", rhs, sep = "")
}
## go to next page
webElems <- remDr$findElement(using = 'css', value="#EATERY_LIST_CONTENTS > div.deckTools.btm > div > a.nav.next.rndBtn.ui_button.primary.taLnk")
webElems$clickElement()
Sys.sleep(2)
setTxtProgressBar(pb,i)
# pause each 5 entry for a minute to try to not go over the query limit
if (i %% 10 == 0 ){Sys.sleep(1)}
}
Now that we had the specific site of a lot of restaurant, we needed to get all the review and date of review of them restaurant. The following code work the same way as the one before exept that the page we visit is given to us by our “first” scrape. We were able to get only around 500 restaurant with full review due to similar problem as the first code.
#we encounter some errors during the scrape so we remove all rows with NA value
df <- na.omit(df)
#Creation of the new column in the data frame for storing each with the associated date and the restaurant address
df$rate <- list(NA)
df$rate.date <- list(NA)
df <- mutate(df,address = "")
#The first row is a advertising in the TA web, it lead to a different page url which our code can't scrape. So we assign a right address to start
df$page_url[[1]] <- "https://www.tripadvisor.com/Restaurant_Review-g60763-d10145683-Reviews-Dominique_Bistro-New_York_City_New_York.html"
pb = txtProgressBar(min = 0, max = length(df$page_url), style = 3) #Setting up the progress bar
for(m in 1500 : length(df$page_url)){
setTxtProgressBar(pb,m, title = scrape.total)
remDr$navigate(df$page_url[[m]])
Sys.sleep(3)
remDr$screenshot(display = TRUE)#in order to know which restaurant is under the scraping procedure
remDr$getTitle()
df1 <- data.frame(rating = integer(), date = character(),stringsAsFactors = FALSE)
#We obtain the number of page that contain the review (it's internal pages on the restaurant page)
pagenumbers <- remDr$findElement(using = 'css', value = "[class='pageNum last taLnk ']")
pagenumbers <- as.numeric(pagenumbers$getElementAttribute("data-page-number"))
#Scraping the restaurant's address
address <- remDr$findElement(using = 'css', value = "[class='street-address']")
address <- address$getElementText()
df$address[[m]] <- address[[1]]
#Scraping of each review with the related date
pb1 = txtProgressBar(min = 0, max = pagenumbers, style = 3)
for(q in 1:pagenumbers){
list.container <- remDr$findElement(using = 'css', value = "[class='listContainer ']")
rateanddate.elems <- list.container$findChildElements(using = 'css', value="[class='rating reviewItemInline']")
for(w in 1: length(rateanddate.elems)){
rateanddate <- rateanddate.elems [[w]]
rating.elem <- rateanddate$findChildElement(using = 'css selector', value = "div > *:first-child")
df1[w+q*10-10,1] <- rating.elem$getElementAttribute("class")
df1 [w+q*10-10,2] <- rateanddate$getElementText()
}
if (q < pagenumbers){
webElems <- remDr$findElement(using = 'css', value="[class='nav next taLnk ui_button primary']")
webElems$clickElement()
Sys.sleep(3)
}
setTxtProgressBar(pb1,q,title = scrape.comment)
}
#Reporting the value scraped in the data frame
df1 <- as.list(df1)
df$rate[[m]] <- df1$rating
df$rate.date[[m]] <- df1$date
# pause each time we scrape 5 restaurant review so the server doesn't crash. we pause for 3min
if (m %% 5 == 0 ){Sys.sleep(180)}
print(m)
}
At this point it was important to save the scraped data a first time before cleaning them so we wouldn’t have to scrape them again. Note that the scrapping takes a lot of time.
## save data frame :
save(df,file="TA_db.Rda")
In the data cleaning parts we are aiming to simplify our database and stock all the information in an efficient way so we could exploit them. Since we have two dataset we also are going to join them.
We simply load the data we scraped :
load("TA_db.RDA")
First we put our scraped variable in the format we want so they are usefull. All the one that are supposed to be text are formated in a character() form and the date are stock in a date format using mdy(). One problem we had was that TripAdvisor write the last month or so, review in a format that is not exploitable for us (something like : two weeks ago). So we since our mdy() function were putting NA for these data, we decided to simply delete them.
#transform strings as character
TA.db <- TA.db %>% select(-1) %>% mutate(name = as.character(name),cuisine = as.character(cuisine),page_url = as.character(page_url))
#Convert date column into date format
for(i in 1: length(TA.db$rate.date)){
TA.db$rate.date[[i]] <- mdy(TA.db$rate.date[[i]])
}
##The most recent rate date were written in a format like "2 days ago". So it can't be transformed to a date format and after the previous the value is NA. We decid to remove them
###define function that remove NA from a list
na.omit.list <- function(y) { return(y[!sapply(y, function(x) all(is.na(x)))]) }
for (j in 1: length(TA.db$rate.date)){
TA.db$rate[[j]] <- TA.db$rate [[j]] [!is.na(TA.db$rate.date[[j]])]
TA.db$rate.date[[j]] <- na.omit.list(TA.db$rate.date[[j]])
}
The rating information on TripAdvisor was stock in the html as a rating bubble because they show the rating on the site with a bubble rating image. We thus needed to convert the bubble text to a numerical value so we could afterwards, calculate average in time.
#modify the bubble rating into an integer (1-5)
pb1 = txtProgressBar(min = 0, max = length(TA.db$rate), style = 3)
for(j in 1:length(TA.db$rate)){
x <- TA.db$rate [[j]]
for(i in 1 : length(x)){
x [i] <- str_replace(x[i],"ui_bubble_rating bubble_50","5")
x [i] <- str_replace(x[i],"ui_bubble_rating bubble_40","4")
x [i] <- str_replace(x[i],"ui_bubble_rating bubble_30","3")
x [i] <- str_replace(x[i],"ui_bubble_rating bubble_20","2")
x [i] <- str_replace(x[i],"ui_bubble_rating bubble_10","1")
}
x<- as.numeric(x)
TA.db$rate [[j]] <- x
setTxtProgressBar(pb1,j)
}
Since the price range of the restaurant is scraped with the type of cuisine, we need to split the strings in order to keep only the price. We obviously doing so, lose the information about the cuisine type but this information is also present in the other database so we’ll use this one.
#to be sure that we catch the first capital letter
TA.db$cuisine <- str_to_upper(TA.db$cuisine)
for(i in 1: length(TA.db$cuisine)){
TA.db$cuisine[[i]] <- str_extract(TA.db$cuisine[[i]],"[^[A-Z]]+")
}
#Setting the price as a factor
TA.db$cuisine <- as_factor(TA.db$cuisine)
We add a column just to know how much review we have for a restaurant, this seemed like an important information to have.
#Creation of the column for stocking the number of review
TA.db <- mutate(TA.db, nbr_of_review = 0)
for(i in 1 : length(TA.db$rate)){
TA.db$nbr_of_review [[i]]<- length(TA.db$rate[[i]])
}
To plot some time series we needed the average rate in time. What this code does is providing us this information by taking the cumsum() of the rating until each date, divide that by the number of review until the same date and then stock that in a new column.
#Creation of the column
TA.db$average <- list(NA)
for(i in 1: length(TA.db$rate)){
ro <- 1 : length(TA.db$rate[[i]])
TA.db$average [[i]] <- (cumsum(rev(TA.db$rate[[i]])) / ro)
}
#
for(i in 1: length(TA.db$rate)){
TA.db$average[[i]] <- rev(TA.db$average[[i]])
}
We then saved this before nesting the database so we could go back to a clean scrape just in case.
save(TA.db,file = "clean_TA_db.RDA")
Now we realise that it would be better to have nested data so first we artificialy create a data frame as if the scrape had taken one line per review, date,and average and written the other info on each row and then nest it. (ATTENTION : take long time to run)
#Creating the artificial data frame
df <- data_frame(Name = character(),
Price=character(),
Page_url=character(),
Address = character(),
Rate=integer(),
Rate_date=date(),
Nbr_of_review = integer(),
Average = double()
)
pb1 = txtProgressBar(min = 0, max = length(TA.db$name), style = 3)
for(i in 1:length(TA.db$name)){
k <- length(df$Name)
for(j in 1:length(TA.db$rate[[i]])){
df[j+k,1] <- TA.db$name[[i]]
df[j+k,2] <- as.character(TA.db$cuisine[[i]])
df[j+k,3] <- TA.db$page_url[[i]]
df[j+k,4] <- TA.db$address[[i]]
df[j+k,5] <- TA.db$rate[[i]][[j]]
df[j+k,6] <- TA.db$rate.date[[i]][[j]]
df[j+k,7] <- TA.db$nbr_of_review[[i]]
df[j+k,8] <- TA.db$average[[i]][[j]]
}
setTxtProgressBar(pb1,i)
}
## nest the data
nest_df <- df %>% group_by(Name,Price,Page_url,Address,Nbr_of_review) %>% nest()
We load the hygiene data and nest it since we have several control for specific restaurant.
control <- read_csv("./data/DOHMH_New_York_City_Restaurant_Inspection_Results .csv")
nest_control <- control %>% group_by(CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,`CUISINE DESCRIPTION`) %>% nest()
Now we wanted to join the two dataframe together. It appears that we had not a lot of match at first, so we proceed by iteration and harmonise the data as we go. First the needed to have the same colname. Then we realise that we had some name in capital and some weird spacing, so we decided to eliminate all space and put everything in capital letter. We also find that there was some problem with the address, some were note south stree other just S and the number where in one database written with a TH after any number but in the other database it was not the case. In the name of the restaurant some words appears in a database but not in the other for instance sometimes it’s writtent bar at the end of the name, so we suppress some word that were useless. We fixe almost all the problem we had and what we do is quite obvious in the code. So problem stuck beacause they were exeption for one restaurant and it wasn’t logical to write some code for only one restaurant. Also some restaurant we scraped were just not in the hygien database. In the end out of the 500 we were able to keep 300 restaurant.
colnames(nest_df)[1] <- "DBA" ##rename col of nest df to the same name as nest control
address <- as.data.frame(str_split_fixed(nest_df$Address," ",2)) ## split the address of nest df in 2 to be like nest control
nest_df <- nest_df %>% mutate(Address = address$V2, BUILDING = address$V1) ## add to nest df the the 2 "new" col
colnames(nest_df)[4] <- "STREET" ## rename the "new" col that wasn't already right
## loop to transform nest df DBA in capital
for (i in 1: length(nest_df$DBA)){
nest_df$DBA[[i]] <- str_to_upper(nest_df$DBA[[i]])
}
## transform nest df street capital in a char vector and not a list
nest_df$STREET <- lapply(nest_df$STREET, as.character)
## loop to transform nest df street in capital
for (i in 1: length(nest_df$STREET)){
nest_df$STREET[[i]] <- str_to_upper(nest_df$STREET[[i]])
}
## loop to transform nest control street in capital
for (i in 1: length(nest_control$STREET)){
nest_control$STREET[[i]] <- str_to_upper(nest_control$STREET[[i]])
}
## loop to transform nest control DBA in capital
for (i in 1: length(nest_control$DBA)){
nest_control$DBA[[i]] <- str_to_upper(nest_control$DBA[[i]])
}
## unlist nest df to be a char vector
nest_df$STREET <- unlist(nest_df$STREET)
## need to add the TH after any number in the nest_control STREET and other stuff
for(i in 1:length(nest_control$STREET)){
nest_control$STREET [[i]] <- str_replace(nest_control$STREET [[i]],"(?<=[[:digit:]]) ","TH ")##add TH in the place of a space after any number
nest_control$STREET [[i]] <- str_replace(nest_control$STREET [[i]],"(?<=1)TH","ST ")##add ST afet 1 instead of TH
nest_control$STREET [[i]] <- str_replace(nest_control$STREET [[i]],"(?<=2)TH","ND ")##add ND afet 1 instead of TH
nest_control$STREET [[i]] <- str_replace(nest_control$STREET [[i]],"(?<=3)TH","RD ")##add RD afet 1 instead of TH
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]]," ","")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"STREET","ST")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"EAST","E")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"WEST","W")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"SOUTH","S")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"NORTH","N")
nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"AVENUE","AVE")
#nest_control$STREET [[i]] <- str_replace_all(nest_control$STREET [[i]],"AVES","AVE")
}
for(i in 1:length(nest_df$STREET)){
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]]," ","")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"STREET","ST")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"EAST","E")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"WEST","W")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"SOUTH","S")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"NORTH","N")
nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"AVENUE","AVE")
#nest_df$STREET [[i]] <- str_replace_all(nest_df$STREET [[i]],"AVES","AVE")
}
## now we reject all useless stuf in the DBA for instance bar, restaurant, and etc...
for(i in 1:length(nest_control$DBA)){
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]]," ","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"&","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"THE","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"AND","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"BAR","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"RESTAURANT","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"CAFE","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"CHEESE","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"WINE","")
nest_control$DBA [[i]] <- str_replace_all(nest_control$DBA [[i]],"STREET","ST")
}
for(i in 1:length(nest_df$DBA)){
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]]," ","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"&","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"THE","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"AND","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"BAR","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"RESTAURANT","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"CAFE","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"CHEESE","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"WINE","")
nest_df$DBA [[i]] <- str_replace_all(nest_df$DBA [[i]],"STREET","ST")
}
The left_join command then join the two dataset together :
##left join of the 2 nested data frame
Merged_df <- left_join(nest_df,nest_control,by = c("DBA", "STREET"))
## na.omit of the whole nested data frame
Merged_df <- na.omit(Merged_df)
We add the goelocalisation with one of the google api, as mentioned before we needed to register for a key and load a new package of ggplot through the package devtools. Then we just loop on the address and get the coordinates :
register_google(key = "AIzaSyA6LwK5aDH3JTsgpRrt3FQv4G7Jv5siz74")
for(i in 1 : length(Merged_df$STREET) ){
Merged_df$location [[i]] <- character()
Merged_df$location [i] <- str_c(Merged_df$BUILDING.x[i], Merged_df$STREET[i],Merged_df$BORO, sep = " ")
}
for(i in 1:length(Merged_df$location)){
Merged_df$location[[i]] <- geocode(Merged_df$location[[i]])
}
We now save our clean and merged databse :
Merged_df <- Merged_df[!duplicated(Merged_df$DBA),] ## enlève les duplication des noms créer par left_join
save(Merged_df,file = "Merged_df.RDA")
We first want on how much time our data have been “collected”, so we display the time interval of the scraped database and the one of the hygien inspection database :
d_range_TA <- Merged_df %>%
unnest(data.x) %>%
mutate(Rate_date = as_date(as.numeric(Rate_date)))
interval(min(d_range_TA$Rate_date), max(d_range_TA$Rate_date)) %>% kable(col.names = "Rating date interval") %>% kable_styling(position='center')
| Rating date interval |
|---|
| 2004-06-04 UTC–2018-03-22 UTC |
d_range_insp <- Merged_df %>%
unnest(data.y) %>%
mutate(`INSPECTION DATE` = mdy(`INSPECTION DATE`)) %>%
filter(year(`INSPECTION DATE`) > 1910)
interval(min(d_range_insp$`INSPECTION DATE`), max(d_range_insp$`INSPECTION DATE`)) %>% kable(col.names = "Inspection date interval") %>% kable_styling(position='center')
| Inspection date interval |
|---|
| 2013-04-11 UTC–2017-08-25 UTC |
The two main categorization we are going to use in our analysis are by price and by cuisine. First the price are divided in 3 categorie : cheap, middle price, and expensive. We group the data by price and the display the number of data we have in each groups. We bassically do the same for the cuisine description exept that we had too much factor at first so we decided to collapse some of them (see code for details).
Merged_df %>% group_by(Price) %>%
dplyr::count() %>%
kable( col.names = c("Price range", "Count"))%>% kable_styling(position='center')
| Price range | Count |
|---|---|
| € | 48 |
| €€ - €€€ | 203 |
| €€€€ | 70 |
Merged_df <- Merged_df %>% mutate(`CUISINE DESCRIPTION` = fct_collapse(`CUISINE DESCRIPTION`,
Asian = c("Asian", "Chinese", "Japanese", "Korean", "Vietnamese/Cambodian/Malaysia", "Thai"),
Italian = c("Italian", "Pizza", "Pizza/Italian"),
European = c("Eastern European", "English", "Greek", "Irish", "Mediterranean", "Portuguese", "Scandinavian", "Spanish"),
Latin = c("Latin (Cuban, Dominican, Puerto Rican, South & Central American)", "Brazilian", "Tex-Mex", "Mexican", "Hawaiian"),
American = c("American", "Barbecue", "Hamburgers", "Steak", "Continental"),
"Middle Eastern" = c("Middle Eastern", "Jewish/Kosher", "Moroccan", "Turkish"),
"Snack & Other" = c("Bagels/Pretzels", "Café/Coffee/Tea", "Delicatessen", "Donuts", "Soups", "Other", "Salads", "Sandwiches", "Sandwiches/Salads/Mixed Buffet")))
Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>%
dplyr::count() %>%
kable( col.names = c("Type of cuisine", "Count"))%>% kable_styling(position='center')
| Type of cuisine | Count |
|---|---|
| American | 115 |
| Asian | 28 |
| Snack & Other | 16 |
| Latin | 13 |
| European | 20 |
| French | 27 |
| Indian | 3 |
| Italian | 75 |
| Middle Eastern | 8 |
| Seafood | 10 |
| Vegetarian | 6 |
We decided to plot some map regarding to the categories mentioned above. The point was to see if there where some “region” of manhattan that were more pricy or that only have restaurant with a type of cuisine.
#we remark that there is 2 restaurant which the adress is false, thus they are badly plot. We fix them by hand.
register_google(key = "AIzaSyA6LwK5aDH3JTsgpRrt3FQv4G7Jv5siz74")
Merged_df$location[[171]] <- geocode("15 Main St, Brooklyn, NY 11201, États-Unis")
Merged_df$location[[189]] <- geocode("229 S 4th St, between Havemeyer St & Roebling St, Brooklyn, NY 11211-5605")
# subset out your data_filter:
df_lowprice <- subset(Merged_df, Price == "€ ")
df_mediumprice <- subset(Merged_df, Price == "€€ - €€€ ")
df_highprice <- subset(Merged_df, Price == "€€€€ ")
data_filterlist = list( df_lowprice = subset(Merged_df, Price == "€ "),
df_mediumprice = subset(Merged_df, Price == "€€ - €€€ "),
df_highprice = subset(Merged_df, Price == "€€€€ "))
# Remember we also had these groups associated with each variable? Let's put them in a list too:
layerlist = c("1 cheap", "2 middle price",
"3 expensive")
# We can keep that same color variable:
colorFactors = colorFactor(c('red', 'orange', 'green'), domain=Merged_df$Price)
icons <- list()
icons [[1]] <- awesomeIcons(
icon = 'ios-close',
iconColor = 'black',
library = 'ion',
markerColor = "red")
icons [[2]] <- awesomeIcons(
icon = 'ios-close',
iconColor = 'black',
library = 'ion',
markerColor = "orange")
icons [[3]] <- awesomeIcons(
icon = 'ios-close',
iconColor = 'black',
library = 'ion',
markerColor = "green")
m <- leaflet() %>% addTiles()
# Now we have our loop - each time through the loop, it is adding our markers to the map object:
for (i in 1:length(data_filterlist)){
m = addAwesomeMarkers(m,
lng=unname(sapply(data_filterlist[[i]]$location, `[[`, 1)),
lat=unname(sapply(data_filterlist[[i]]$location, `[[`, 2)),
label = data_filterlist[[i]]$DBA,
icon = icons[[i]],
group = layerlist[i]
)
}
m = addProviderTiles(m, "Stamen.TonerLite", group = "Toner Lite")
m = addLayersControl(m,
overlayGroups = layerlist,
baseGroups = c("basic","Toner Lite")
)
m
On the map we can see that there is not really some apparent neighbourhood with for example more pricy restaurant. If you zoom really closely you can actually see some pattern, for instance restaurant are more pricy on 5th avenue. But maybe don’t have enough data or it’s just that manhattan is too big and that two street, even close to each other can have really different restaurant. Thus this will not really help us in our analysis since it’s going to be hard to class the restaurant by relevant neighbourhood and then analyse the effect of hygien control on them. It still fun to play with the map since it’s interactive and it allow us to see a little bit more with what we are dealing with for instance all our restaurant are focused on manhattan even if they were scraped from the NY page of TripAdvisor. We wanted to plot also the map by type of cuisine but early version of such a map led us to the same conclusion so we decided to drop it.
We produce a bar plot of the number of restaurant we have by type of cuisine. It allow us to see that clearly there is 4 major categorie. This will thus help us later to decide which graph to plot.
Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>% dplyr::count() %>% ungroup() %>%
mutate(`CUISINE DESCRIPTION` = fct_reorder(`CUISINE DESCRIPTION`, n)) %>%
ggplot(aes(x = `CUISINE DESCRIPTION`, y = n)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 1), plot.title = element_text(hjust = 0.5)) +
labs(title = "Number of restaurant in each type of cuisine", x = "Type of cuisine", y = "Count")
We now plot the price proportion for each type of cuisine. It’s interesting to see that for the 4 main categorie, we have french cuisine that tend to be more pricy, asian that is more cheap and the two other in the midle. This fit quite well which the cliche idea of what a type of cuisine is like. It will also help us to choose which plot to make and lead us to the idea that we should plot time series for the 4 main categorie with some restaurant of each price range.
Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>% mutate(n = n(), percentage = n/sum(n)) %>%
ggplot(aes(`CUISINE DESCRIPTION`, percentage, fill = Price)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 1), plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = scales::percent) +
labs(title = "Price proportion across type of cuisine", x = "Type of cuisine", y = "Proportion")
We first create some column with the number of inspection for each restaurant and a column that count the critical
Merged_df <- Merged_df %>% mutate(nb_inspection = 0, nb_insp_critical = 0)
for (i in 1:nrow(Merged_df)) {
Merged_df$nb_inspection[[i]] <- nrow(Merged_df$data.y[[i]])
data.y.i <- Merged_df$data.y[[i]]
Merged_df$nb_insp_critical[[i]] <- sum(str_count(data.y.i$`CRITICAL FLAG`, "^Critical"))
}
Merged_df <- Merged_df %>% mutate(critical_ratio = nb_insp_critical/nb_inspection)
We can now plot the average number of inspection by restaurant and the average crtical ratio (means when a control was judge as not good enough) :
Merged_df %>% summarise(avg_nb_inspection = mean(nb_inspection), avg_crit_ratio = mean(critical_ratio)) %>%
kable( col.names = c("Average number of inspection", "Average critical ratio"))%>% kable_styling(position='center')
| Average number of inspection | Average critical ratio |
|---|---|
| 17.04984 | 0.5449974 |
we do the same by type of cuisine :
Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>%
summarise(avg_nb_inspection = mean(nb_inspection), avg_crit_ratio = mean(critical_ratio)) %>%
kable( col.names = c( "Type of cuisine", "Average number of inspection", "Average critical ratio"))%>% kable_styling(position='center')
| Type of cuisine | Average number of inspection | Average critical ratio |
|---|---|---|
| American | 17.23478 | 0.5514914 |
| Asian | 21.71429 | 0.5279837 |
| Snack & Other | 17.00000 | 0.4764363 |
| Latin | 16.00000 | 0.5148083 |
| European | 18.00000 | 0.5603765 |
| French | 12.40741 | 0.5110473 |
| Indian | 16.00000 | 0.5643501 |
| Italian | 16.82667 | 0.5567985 |
| Middle Eastern | 20.62500 | 0.5611786 |
| Seafood | 16.10000 | 0.5812447 |
| Vegetarian | 12.00000 | 0.6105006 |
We were expecting to see some differences in the critical ratio, but what we find out here is that there is not really a type of cuisine that get more red flag. This is kind of logical, it just means that every restaurant whatever they cook, always make hygien mistakes. We want to add that a red flag can be for little things though because hygien control are taken really seriously. For the control that are not critical we find out that they are often a followup control of a critical one. From the graph above we can also say that some type of cuisine get a more control for instance asian cuisine get in average almost twice as much tested as french cuisine. This confort us in the idea that people see asian cuisine as more cheap than french one.
We then plot this information in a boxplot :
Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>%
ggplot(aes(x = `CUISINE DESCRIPTION`, y = nb_inspection, color = critical_ratio)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 1), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
labs(title = "Number of inspection by restaurant", subtitle = "Grouped by type of cuisine", x = "Type of cuisine", y = "Number of inspection") +
scale_color_gradient(low = "green", high = "red")
And a plot of the critical ratio by cuisine :
Merged_df %>% group_by(`CUISINE DESCRIPTION`) %>%
ggplot(aes(x = `CUISINE DESCRIPTION`, y = nb_inspection, color = critical_ratio)) +
geom_jitter() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.3, hjust = 1),
plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) +
labs(title = "Number of inspection by restaurant", subtitle = "Grouped by type of cuisine", color = "Critical ratio",
x = "Type of cuisine", y = "Number of inspection") +
scale_color_gradient(low = "green", high = "red")
These two plot gave us the same information as before but more clearly. In the last one we can observe the dispersion more precisely. It allow us to find out that the only pattern/corelation we can find is that when a resteaurant is tested a lot of time he probably will have an average critical ratio, and that the extreme value for critical ratio are on restaurant that did not get controled a lot. This seems quite logical since when the hygien control are bad they will do a followup control so it would average the value of the critical ratio. What is more weird is a restaurant that has not so much control but a huge critical ratio, we can ask ourselves what happend here ? it’s probaly just that the control were critical but in reality it was just a detail and everything else in the kitchen was good, thus they did only one followup control. All those information are interesting but it don’t really allow us to conclude something meaningful for our inferences.
We then try to make a linear regression to predict the critical ratio with the number of inspection and the cuisine type. The p-value are really bad and that let us think that there is not really any corelation between our data thus we will go no further trying to fit a prediction model.
kable(tidy(lm(Merged_df, formula = critical_ratio ~ nb_inspection + `CUISINE DESCRIPTION` + Price)))%>% kable_styling(position='center')
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.5284424 | 0.0329110 | 16.0566949 | 0.0000000 |
| nb_inspection | 0.0025466 | 0.0009196 | 2.7691522 | 0.0059622 |
CUISINE DESCRIPTIONAsian
|
-0.0347615 | 0.0313714 | -1.1080633 | 0.2687018 |
CUISINE DESCRIPTIONSnack & Other
|
-0.0929931 | 0.0441090 | -2.1082572 | 0.0358189 |
CUISINE DESCRIPTIONLatin
|
-0.0386181 | 0.0433280 | -0.8912952 | 0.3734688 |
CUISINE DESCRIPTIONEuropean
|
0.0086309 | 0.0359052 | 0.2403797 | 0.8101963 |
CUISINE DESCRIPTIONFrench
|
-0.0206245 | 0.0322874 | -0.6387796 | 0.5234426 |
CUISINE DESCRIPTIONIndian
|
0.0204248 | 0.0862554 | 0.2367941 | 0.8129745 |
CUISINE DESCRIPTIONItalian
|
0.0044639 | 0.0219888 | 0.2030083 | 0.8392630 |
CUISINE DESCRIPTIONMiddle Eastern
|
-0.0010916 | 0.0540052 | -0.0202129 | 0.9838867 |
CUISINE DESCRIPTIONSeafood
|
0.0384283 | 0.0487866 | 0.7876806 | 0.4314913 |
CUISINE DESCRIPTIONVegetarian
|
0.0699440 | 0.0622480 | 1.1236350 | 0.2620458 |
| Price€€ - €€€ | -0.0184454 | 0.0270913 | -0.6808590 | 0.4964739 |
| Price€€€€ | -0.0388988 | 0.0315924 | -1.2312708 | 0.2191640 |
We then try to see if there was a corelation between the critical ratio and the price categorie of a restaurant. Once again there is not and we clearly see that on the graph :
Merged_df %>% group_by(`CUISINE DESCRIPTION`, Price) %>%
ggplot(aes(x = Price, y = critical_ratio, color = Price)) +
geom_jitter() +
facet_wrap(~`CUISINE DESCRIPTION`) +
geom_point(data = Merged_df %>%
group_by(`CUISINE DESCRIPTION`, Price) %>%
summarise(avg_crit_ratio = mean(critical_ratio)),
aes(x = Price, y = avg_crit_ratio, color = "Mean"))
It is now time to plot the time series of different restaurant. We decided to plot 9 restaurant for the 4 main categorie (3 of them for each range of price). We only kept the critical control for theses plot since we want to see if there is a drop in the TripAdvisor rating after a control that was judge critical.
We thus select the restaurant we want, the code speak for itself :
french <- Merged_df %>%
filter(`CUISINE DESCRIPTION` == "French") %>% group_by(Price) %>%
arrange(desc(critical_ratio)) %>%
filter(row_number() %in% 1:3) %>%
ungroup() %>%
unnest(data.y, .drop = FALSE) %>%
filter(`CRITICAL FLAG` == "Critical") %>%
distinct(`INSPECTION DATE`, .keep_all = TRUE)%>%
select(DBA, Price, `CUISINE DESCRIPTION`, `INSPECTION DATE`) %>%
nest(`INSPECTION DATE`)
french.TA <- Merged_df %>%
filter(`CUISINE DESCRIPTION` == "French") %>% group_by(Price) %>%
arrange(desc(critical_ratio)) %>%
filter(row_number() %in% 1:3) %>%
ungroup() %>%
unnest(data.x, .drop = FALSE) %>%
distinct(Rate_date, .keep_all = TRUE)%>%
select(DBA, Price, `CUISINE DESCRIPTION`, Average,Rate,Rate_date)%>%
nest(Rate,Rate_date,Average)
french <- french %>% mutate(data.x = french.TA$data)
rm(french.TA)
This was only for the french restaurant but we did exactly the same for Italian, Asian and American cuisine.
Now we realise that we need to pad the time series. Obviously there is not a review on trip advisor on a regular basis since it’s done by users. We thus use the package padr wich is really handy for cases like that. What we do is insert a begining and an ending date that are then supposed to be the max and min of the series. Then we pad the series wich means first add a row for each day between those two dates, and then fill with the previous rate average that padr can find. It works well and allow us to make some plot afterwards.
## set date as date format
for(i in 1: length(french$data)){
french$data[[i]][[1]] <- mdy(french$data[[i]][[1]])}
for(i in 1: length(french$data.x)){
french$data.x[[i]][[2]] <- as_date(as.numeric(french$data.x[[i]][[2]]))}
for(i in 1: length(italian$data)){
italian$data[[i]][[1]] <- mdy(italian$data[[i]][[1]])}
for(i in 1: length(italian$data.x)){
italian$data.x[[i]][[2]] <- as_date(as.numeric(italian$data.x[[i]][[2]]))}
for(i in 1: length(american$data)){
american$data[[i]][[1]] <- mdy(american$data[[i]][[1]])}
for(i in 1: length(american$data.x)){
american$data.x[[i]][[2]] <- as_date(as.numeric(american$data.x[[i]][[2]]))}
for(i in 1: length(asian$data)){
asian$data[[i]][[1]] <- mdy(asian$data[[i]][[1]])}
for(i in 1: length(asian$data.x)){
asian$data.x[[i]][[2]] <- as_date(as.numeric(asian$data.x[[i]][[2]]))}
## now pad
for(i in 1: length(french$data.x)) {
french$data.x[[i]] <- french$data.x[[i]] %>% add_row(Rate_date = as_date("2018-06-01")) %>% add_row(Rate_date = as_date("2000-06-01"))
french$data.x[[i]] <- french$data.x[[i]] %>% pad() %>% tidyr::fill(Average)}
for(i in 1: length(italian$data.x)) {
italian$data.x[[i]] <- italian$data.x[[i]] %>% add_row(Rate_date = as_date("2018-06-01")) %>% add_row(Rate_date = as_date("2000-06-01"))
italian$data.x[[i]] <- italian$data.x[[i]] %>% pad() %>% tidyr::fill(Average)}
for(i in 1: length(american$data.x)) {
american$data.x[[i]] <- american$data.x[[i]] %>% add_row(Rate_date = as_date("2018-06-01")) %>% add_row(Rate_date = as_date("2000-06-01"))
american$data.x[[i]] <- american$data.x[[i]] %>% pad() %>% tidyr::fill(Average)}
for(i in 1: length(asian$data.x)) {
asian$data.x[[i]] <- asian$data.x[[i]] %>% add_row(Rate_date = as_date("2018-06-01")) %>% add_row(Rate_date = as_date("2000-06-01"))
asian$data.x[[i]] <- asian$data.x[[i]] %>% pad() %>% tidyr::fill(Average)}
We then plot the time series with plotly. We used that package because it allow us to make group that can be display or not to get some more clarity. We display the code for the french cuisines plot, the other are similar exept that there is only seven restaurant instead of nine for the french cuisine since we were not able to find 3 cheap one.
for (i in 1:length(french$data.x)) {
assign(paste0("hovertext", i), paste0("Name:<b>", french$DBA[[i]], "</b><br>","Date:<b>", french$data.x[[i]][[2]], "</b><br>",
"Rating :<b>", french$data.x[[i]][[3]], "</b><br>"))
}
for (i in 1:length(french$data.x)) {
assign(paste0("ypos", i), french$data.x[[i]][[3]][french$data.x[[i]][[2]] %in% french$data[[i]][[1]]])
}
p <- plot_ly(french) %>%
add_lines(x = french$data.x[[1]][[2]],
y = french$data.x[[1]][[3]],
line = list(color = "#00526d", width = 2),
hoverinfo = "text", text = hovertext1, name = french$DBA[1],
legendgroup = "groupmiddle",
showlegend = T) %>%
add_markers(x = sort(french$data[[1]][[1]]),
y = ypos1,
marker = list(size = 10, color = "#00526d"),
showlegend = F,
legendgroup = "groupmiddle")%>%
add_lines(x = french$data.x[[2]][[2]],
y = french$data.x[[2]][[3]],
line = list(color = "green", width = 2),
hoverinfo = "text", text = hovertext2, name = french$DBA[2],
legendgroup = "grouphigh",
showlegend = T) %>%
add_markers(x = sort(french$data[[2]][[1]]),
y = ypos2,
marker = list(size = 10, color = "green"),
showlegend = F,
legendgroup = "grouphigh")%>%
add_lines(x = french$data.x[[3]][[2]],
y = french$data.x[[3]][[3]],
line = list(color = "green", width = 2),
hoverinfo = "text", text = hovertext3, name = french$DBA[3],
legendgroup = "grouphigh",
showlegend = T) %>%
add_markers(x = sort(french$data[[3]][[1]]),
y = ypos3,
marker = list(size = 10, color = "green"),
showlegend = F,
legendgroup = "grouphigh")%>%
add_lines(x = french$data.x[[4]][[2]],
y = french$data.x[[4]][[3]],
line = list(color = "#00526d", width = 2),
hoverinfo = "text", text = hovertext4, name = french$DBA[4],
legendgroup = "groupmiddle",
showlegend = T) %>%
add_markers(x = sort(french$data[[4]][[1]]),
y = ypos4,
marker = list(size = 10, color = "#00526d"),
showlegend = F,
legendgroup = "groupmiddle")%>%
add_lines(x = french$data.x[[5]][[2]],
y = french$data.x[[5]][[3]],
line = list(color = "green", width = 2),
hoverinfo = "text", text = hovertext5, name = french$DBA[5],
legendgroup = "grouphigh",
showlegend = T) %>%
add_markers(x = sort(french$data[[5]][[1]]),
y = ypos5,
marker = list(size = 10, color = "green"),
showlegend = F,
legendgroup = "grouphigh")%>%
add_lines(x = french$data.x[[6]][[2]],
y = french$data.x[[6]][[3]],
line = list(color = "#00526d", width = 2),
hoverinfo = "text", text = hovertext6, name = french$DBA[6],
legendgroup = "groupmiddle",
showlegend = T) %>%
add_markers(x = sort(french$data[[6]][[1]]),
y = ypos6,
marker = list(size = 10, color = "#00526d"),
showlegend = F,
legendgroup = "groupmiddle")%>%
add_lines(x = french$data.x[[7]][[2]],
y = french$data.x[[7]][[3]],
line = list(color = "red", width = 2),
hoverinfo = "text", text = hovertext7, name = french$DBA[7],
legendgroup = "grouplow",
showlegend = T) %>%
add_markers(x = sort(french$data[[7]][[1]]),
y = ypos7,
marker = list(size = 10, color = "red"),
showlegend = F,
legendgroup = "grouplow")%>%
layout(
title = "Time series of french cuisine",
legend = list(x = 100, y = 0.5),
xaxis = list(
rangeselector = list(
buttons = list(
list(
count = 5,
label = "5 yr",
step = "year",
stepmode = "backward"),
list(
count = 3,
label = "3 yr",
step = "year",
stepmode = "backward"),
list(
count = 2,
label = "2 yr",
step = "year",
stepmode = "backward"),
list(
count = 1,
label = "1 yr",
step = "year",
stepmode = "backward"),
list(step = "all"))),
rangeslider = list(type = "date")),
yaxis = list(title = "Rating"),
xaxis = list(title = "Date"))
p